# Setup
import dask.dataframe as dd
import pandas as pd
import numpy as np
import pyreadstat
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import interp1d
from dash import Dash, dcc, html, Input, Output, State
import plotly.graph_objects as go
import plotly.express as px
from dash.dash_table import DataTable
from sklearn.impute import KNNImputer
from shiny import App, ui, reactive, render
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import statsmodels.api as sm
import tempfile
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.impute import KNNImputer
from sklearn.cluster import KMeans
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import altair as altFinal Project
Group Members:
Name / Cnetid / Github_username / section
1.Pei-Chin Lu / Peichin / Pei0504 / section 4
Yuan Qi / yuanqi / freyaqi / section 4
Huiting(Aurora) Zhang / zhanght / aurorazhang688 / section 4
Data File
Due to file size limitations, the necessary data file for this project is hosted on Google Drive. You can download it using the following link:
[Download Original Data File](https://drive.google.com/file/d/1TGlCQULvQDOUOzWxdpDD60O2lJmcQer0/view?usp=sharing
Make sure to download this file and place it in the appropriate directory before running the project.
Research Question and the approach we took
Research in various fields has shown that residential mobility influences key aspects of how individuals think about themselves, interact with others, and perceive public rules. Based on this, our project primarily investigates the relationship between social mobility (e.g., economic opportunities, migration patterns, and educational access) and a range of personal characteristics. Specifically, we will pay attention to residents’ Happiness (HAPPY), Trust (TRUST), and Fairness (FAIR). These variables were selected because they represent essential aspects of individual well-being, interpersonal dynamics, and societal norms. Establishing reliable connections between these factors is crucial for designing effective public policies that enhance social creativity and public well-being.
To achieve this, we conduct a series of regression analyses. Additionally, to ensure that our variable selection is free from selection bias or “cherry-picking”(hand-picking variables to show favorable insights), we have implemented additional measures. We developed a method to objectively verify the validity of our variable selection, including the use of Exploratory Graph Analysis (EGA) and statistical tests, to classify variables as “Interested,” “Proximate,” or “Distal.” These measures and visualizations help enhance the credibility of our research and ensure that our conclusions are derived from objective data analysis. All these charts and visualizations will ultimately be presented in our Shiny app.
Literature Base:
A growing body of literature highlights the significant impact of residential mobility on both individual and cultural dynamics. At the individual level, residential mobility has been linked to increased individualism, a heightened sense of freedom, optimism, and well-being. At the group level, it fosters broader but shallower social networks, higher relational mobility, and greater trust in strangers.
A study Shifts in Residential Mobility Predict Shifts in Culture by Buttrick, Cha, and Oishi used quantitative methods to examine the positive impact of residential mobility on cultural variables such as trust, fairness, and happiness.
Research Objective
Building on these studies, we aim to revisit this topic using the methodology of Buttrick, Cha, and Oishi to verify the impact of residential mobility on key cultural dimensions. The results will offer valuable insights for shaping mobility-related policies and understanding their broader cultural implications.
Setup
Load Dataset
# Load data
gss_data, meta = pyreadstat.read_sav(r"F:\python2\Final-Project\Data\GSS7218_R3.sav")
labels_data = pd.read_excel(r"F:\python2\Final-Project\Data\suggestions_modified.xlsx")We utilize publicly available datasets such as the General Social Survey (GSS) from 1978 to 2018, which provides comprehensive data on social trends, public happiness, and socio-economic factors in the United States. In addition, we will incorporate data on U.S. Immigration and GDP, sourced from government and financial databases like the Federal Reserve Economic Data (FRED). These datasets will be merged and preprocessed within our Shiny app to ensure consistency and compatibility.
Immigration Data (DHS)
The yearly immidration population in the United States, as one covariate of our regression model. Data source: https://www.dhs.gov/immigration-statistics
Gross Domestic Product (GDP) Data (FRED)
Yearly gross domestic product data in the United States as one covariate of our regression model. Data source: https://fred.stlouisfed.org/
Residential Mobility Data (ACS)
Yearly mobility level in the United States, as independent variables of our regression model Data source: https://www.census.gov/programs-surveys/acs
In labels_data, we categorized each personal characteristic variable into one of the following types for user selection in the Shiny app interface: Likert Scale Variables, Binary Variables, Continuous Variables, Multichoice Variables, Administration Variables.
Clean Data
# Define columns to recode
column_recode_3to2 = ["COURTS", "RELITEN", "HELPFUL", "FAIR", "TRUST", "AGED", "FINALTER", "DIVLAW"]
column_recode_4othertomissing = ["GETAHEAD"]
column_recode_5othertomissing = ["PREMARSX", "XMARSEX", "HOMOSEX"]
# Recode columns
gss_data[column_recode_4othertomissing] = gss_data[column_recode_4othertomissing].replace(4, np.nan)
gss_data[column_recode_5othertomissing] = gss_data[column_recode_5othertomissing].replace(5, np.nan)
# Recode values in column_recode_3to2
gss_data[column_recode_3to2] = gss_data[column_recode_3to2].replace(3, 9992)
gss_data[column_recode_3to2] = gss_data[column_recode_3to2].replace(2, 9993)
gss_data[column_recode_3to2] = gss_data[column_recode_3to2].replace(9992, 2)
gss_data[column_recode_3to2] = gss_data[column_recode_3to2].replace(9993, 3)# Calculate the mean of each column by year
data_by_year = gss_data.groupby("YEAR").mean(numeric_only=True).reset_index()
print(data_by_year.columns)Index(['YEAR', 'ID', 'WRKSTAT', 'HRS1', 'HRS2', 'EVWORK', 'OCC', 'PRESTIGE',
'WRKSLF', 'WRKGOVT',
...
'NEISAFE', 'RLOOKS', 'RGROOMED', 'RWEIGHT', 'RHLTHEND', 'WTSS',
'WTSSNR', 'WTSSALL', 'VSTRAT', 'VPSU'],
dtype='object', length=6110)
We cleaned the GSS dataset by recoding specific variables to ensure consistency and handle missing values. Columns with specific values (e.g., 4 or 5) were recoded to NaN, while others had their values swapped to align with our analysis needs. After preprocessing, we calculated the mean of each column grouped by year and saved the results to a CSV file for further analysis.
# Calculate missing values per column in 'data_by_year'
nan_count_per_column = data_by_year.isna().sum()
# Mark the missing data of each column in labels_data
labels_data['missing_count'] = labels_data['variable'].map(nan_count_per_column)# Print the result
print(labels_data.head()) Likert Scale Variables Binary Variables Continuous Variables \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 1
4 0 0 1
Multichoice variables Administration variable variable \
0 0 1 YEAR
1 0 1 ID
2 1 0 WRKSTAT
3 0 0 HRS1
4 0 0 HRS2
label \
0 GSS year for this respondent
1 Respondent ID number
2 Labor force status
3 Number of hours worked last week
4 Number of hours usually work a week
labels Note Source \
0 NaN NaN NaN
1 NaN NaN NaN
2 IAP = 0; WORKING FULLTIME = 1; WORKING PARTTIM... NaN NaN
3 IAP = -1; 89+ hrs = 89; DK = 98; NA = 99 NaN NaN
4 IAP = -1; 89+ hrs = 89; DK = 98; NA = 99 NaN NaN
missing_count
0 0
1 0
2 0
3 1
4 1
We calculated the number of missing values for each column in data_by_year and mapped these counts to the corresponding variables in labels_data.
# Get the current column names
current_colnames = gss_data.columns
# Convert column names to lowercase and then capitalize the first letter
new_colnames = [col.lower().capitalize() for col in current_colnames]
# Prepare for EGA
gss_data_plot = gss_data.copy()
# Assign the new column names to the data frame
gss_data_plot.columns = new_colnames
# Set the first column name to "year"
gss_data_plot.columns.values[0] = "year"# Print the new column names to verify
print(gss_data_plot.columns)
print(gss_data_plot.head())Index(['year', 'Id', 'Wrkstat', 'Hrs1', 'Hrs2', 'Evwork', 'Occ', 'Prestige',
'Wrkslf', 'Wrkgovt',
...
'Neisafe', 'Rlooks', 'Rgroomed', 'Rweight', 'Rhlthend', 'Wtss',
'Wtssnr', 'Wtssall', 'Vstrat', 'Vpsu'],
dtype='object', length=6110)
year Id Wrkstat Hrs1 Hrs2 Evwork Occ Prestige Wrkslf Wrkgovt \
0 1972.0 1.0 1.0 NaN NaN NaN 205.0 50.0 2.0 NaN
1 1972.0 2.0 5.0 NaN NaN 1.0 441.0 45.0 2.0 NaN
2 1972.0 3.0 2.0 NaN NaN NaN 270.0 44.0 2.0 NaN
3 1972.0 4.0 1.0 NaN NaN NaN 1.0 57.0 2.0 NaN
4 1972.0 5.0 7.0 NaN NaN 1.0 385.0 40.0 2.0 NaN
... Neisafe Rlooks Rgroomed Rweight Rhlthend Wtss Wtssnr Wtssall \
0 ... NaN NaN NaN NaN NaN 1.0 1.0 0.4446
1 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
2 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
3 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
4 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
Vstrat Vpsu
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
[5 rows x 6110 columns]
# Remove the column at index 2 (equivalent to removing ID column)
gss_data_cleaned = gss_data.drop(gss_data.columns[1], axis=1)
# Group by the 'YEAR' column and calculate the mean for each year, ignoring NaN values
data_mean_by_year = gss_data_cleaned.groupby('YEAR').mean(numeric_only=True).reset_index()# Get the current column names
current_colnames = data_mean_by_year.columns
# Convert column names to lowercase and then capitalize the first letter
new_colnames = [col.lower().capitalize() for col in current_colnames]
# Assign the new column names to the DataFrame
data_mean_by_year.columns = new_colnames# Display the updated DataFrame with new column names
print(data_mean_by_year.head())
# Sve the dataframe as .csv
data_mean_by_year.to_csv(r"F:\python2\Final-Project\Data\data_mean_by_year.csv", index=False) Year Wrkstat Hrs1 Hrs2 Evwork Occ Prestige \
0 1972.0 3.455673 NaN NaN 1.246201 489.826327 38.485142
1 1973.0 3.573803 39.882503 40.766667 1.226502 495.518055 38.623123
2 1974.0 3.585580 39.828610 38.977273 1.199688 475.305310 39.447076
3 1975.0 3.575168 38.967277 41.400000 1.210692 487.310472 38.483321
4 1976.0 3.625083 39.659973 39.892857 1.220365 483.449040 38.400888
Wrkslf Wrkgovt Commute ... Neisafe Rlooks Rgroomed Rweight \
0 1.897099 NaN NaN ... NaN NaN NaN NaN
1 1.890126 NaN NaN ... NaN NaN NaN NaN
2 1.914328 NaN NaN ... NaN NaN NaN NaN
3 1.893917 NaN NaN ... NaN NaN NaN NaN
4 1.898669 NaN NaN ... NaN NaN NaN NaN
Rhlthend Wtss Wtssnr Wtssall Vstrat Vpsu
0 NaN 1.0 1.0 0.999827 NaN NaN
1 NaN 1.0 1.0 0.999916 NaN NaN
2 NaN 1.0 1.0 1.000213 NaN NaN
3 NaN 1.0 1.0 0.999844 7026.524161 1.504027
4 NaN 1.0 1.0 1.000063 7126.418946 1.493662
[5 rows x 6109 columns]
# Load the US Immigration data
us_immigration_data = pd.read_csv(r"F:\python2\Final-Project\Data\USImmigration.csv")
# Load the external datasets
us_gdp_data = pd.read_csv(r"F:\python2\Final-Project\Data\FREDGDP.csv")
# Load mobility data
mobility_data = pd.read_csv(r"F:\python2\Final-Project\Data\GSS level 2e.csv")
mobility_data = mobility_data[["year", "Mobility", "Mobilitystate"]]# Change the name of data_mean_by_year
gss_data1 = data_mean_by_year.copy()
# Rename the first column to "year"
gss_data1.rename(columns={gss_data1.columns[0]: "year"}, inplace=True)
# Uniform the data type of year
gss_data1['year'] = gss_data1['year'].astype(int)
mobility_data['year'] = mobility_data['year'].astype(int)# Merge mobility data with the gss data on "year"
gss_data1 = pd.merge(mobility_data, gss_data1, on="year", how="left")# Format the date and filter for October in the GDP data
us_gdp_data['Date'] = pd.to_datetime(us_gdp_data['DATE'], format='%Y-%m-%d')
us_gdp_data['Year'] = us_gdp_data['Date'].dt.year
us_gdp_data['Month'] = us_gdp_data['Date'].dt.month
us_gdp_data = us_gdp_data[us_gdp_data['Month'] == 10]# Verifying by printing the first few rows of each dataset
print(gss_data.head())
print(us_gdp_data.head())
print(us_immigration_data.head()) YEAR ID WRKSTAT HRS1 HRS2 EVWORK OCC PRESTIGE WRKSLF WRKGOVT \
0 1972.0 1.0 1.0 NaN NaN NaN 205.0 50.0 2.0 NaN
1 1972.0 2.0 5.0 NaN NaN 1.0 441.0 45.0 2.0 NaN
2 1972.0 3.0 2.0 NaN NaN NaN 270.0 44.0 2.0 NaN
3 1972.0 4.0 1.0 NaN NaN NaN 1.0 57.0 2.0 NaN
4 1972.0 5.0 7.0 NaN NaN 1.0 385.0 40.0 2.0 NaN
... NEISAFE RLOOKS RGROOMED RWEIGHT RHLTHEND WTSS WTSSNR WTSSALL \
0 ... NaN NaN NaN NaN NaN 1.0 1.0 0.4446
1 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
2 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
3 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
4 ... NaN NaN NaN NaN NaN 1.0 1.0 0.8893
VSTRAT VPSU
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
[5 rows x 6110 columns]
DATE GDP Date Year Month
3 1947-10-01 259.745 1947-10-01 1947 10
7 1948-10-01 280.366 1948-10-01 1948 10
11 1949-10-01 270.627 1949-10-01 1949 10
15 1950-10-01 319.945 1950-10-01 1950 10
19 1951-10-01 356.178 1951-10-01 1951 10
Year Number
0 2019 1031765
1 2018 1096611
2 2017 1127167
3 2016 1183505
4 2015 1051031
We standardized column names, cleaned data by removing unnecessary columns, and calculated yearly averages. External datasets like US Immigration, GDP, and mobility data were processed and merged with GSS data using the “year” column. GDP data was filtered for October observations, ensuring all datasets were aligned and ready for analysis.
# Define the columns to interpolate
columns_to_interpolate = [
"Fair", "Trust", "Happy", "Helpful", "Mobility", "Mobilitystate", "Aged",
"Attend", "Conarmy", "Conbus", "Conclerg", "Coneduc", "Confed", "Confinan",
"Conjudge", "Conlabor", "Conlegis", "Conmedic", "Conpress", "Consci", "Contv",
"Courts", "Divlaw", "Finalter", "Finrela", "Getahead", "Hapmar", "Health",
"Homosex", "Life", "Nataid", "Natarms", "Natcity", "Natcrime", "Natdrug",
"Nateduc", "Natenvir", "Natfare", "Natheal", "Natrace", "News", "Pornlaw",
"Premarsx", "Reliten", "Satjob", "Incom16", "Income", "Rincome", "Partyid",
"Polviews", "Natspac", "Fund", "Fund16", "Spfund", "Class", "Satfin", "Coop",
"Comprend", "Xmarsex"
]# Sort data by 'year' or the relevant column before interpolation
gss_data2 = gss_data1.sort_values(by='year')
# Check data types of columns to be interpolated
column_types = gss_data2[columns_to_interpolate].dtypes
print("Data types of columns to interpolate:")
print(column_types)
# Convert columns to numeric (force errors to NaN)
gss_data2[columns_to_interpolate] = gss_data2[columns_to_interpolate].apply(pd.to_numeric, errors='coerce')Data types of columns to interpolate:
Fair float64
Trust float64
Happy float64
Helpful float64
Mobility object
Mobilitystate object
Aged float64
Attend float64
Conarmy float64
Conbus float64
Conclerg float64
Coneduc float64
Confed float64
Confinan float64
Conjudge float64
Conlabor float64
Conlegis float64
Conmedic float64
Conpress float64
Consci float64
Contv float64
Courts float64
Divlaw float64
Finalter float64
Finrela float64
Getahead float64
Hapmar float64
Health float64
Homosex float64
Life float64
Nataid float64
Natarms float64
Natcity float64
Natcrime float64
Natdrug float64
Nateduc float64
Natenvir float64
Natfare float64
Natheal float64
Natrace float64
News float64
Pornlaw float64
Premarsx float64
Reliten float64
Satjob float64
Incom16 float64
Income float64
Rincome float64
Partyid float64
Polviews float64
Natspac float64
Fund float64
Fund16 float64
Spfund float64
Class float64
Satfin float64
Coop float64
Comprend float64
Xmarsex float64
dtype: object
# Apply interpolation using KNN imputer method
imputer = KNNImputer(n_neighbors=5)
gss_data2[columns_to_interpolate] = imputer.fit_transform(gss_data2[columns_to_interpolate])# Scaling the data (using StandardScaler from sklearn)
scaler = StandardScaler()
# Apply scaling to the selected columns
gss_data2[columns_to_interpolate] = scaler.fit_transform(gss_data2[columns_to_interpolate])# Merging datasets (left join)
gss_data3 = gss_data2.merge(us_immigration_data, how='left', left_on='year', right_on='Year') \
.rename(columns={'Number': 'Immigration'}) \
.merge(us_gdp_data, how='left', left_on='year', right_on='Year')
# Creating lag columns
columns_to_lag = [
"Fair", "Trust", "Happy", "Helpful", "Mobility", "Mobilitystate", "Aged", "Attend", "Conarmy",
"Conbus", "Conclerg", "Coneduc", "Confed", "Confinan", "Conjudge", "Conlabor", "Conlegis",
"Conmedic", "Conpress", "Consci", "Contv", "Courts", "Divlaw", "Finalter", "Finrela", "Getahead",
"Hapmar", "Health", "Homosex", "Life", "Nataid", "Natarms", "Natcity", "Natcrime", "Natdrug",
"Nateduc", "Natenvir", "Natfare", "Natheal", "Natrace", "News", "Pornlaw", "Premarsx", "Reliten",
"Satjob", "Incom16", "Income", "Rincome", "Partyid", "Polviews", "Natspac", "Fund", "Fund16",
"Spfund", "Class", "Satfin", "Coop", "Comprend", "Xmarsex"
]
# Creating lag columns for each column in columns_to_lag
for column in columns_to_lag:
gss_data3[f'{column}Lag'] = gss_data3[column].shift(1)
# Calculating statistics (can be adjusted as needed)
calculateStatistics_data = gss_data3print(calculateStatistics_data.head())
calculateStatistics_data.to_csv(r"F:\python2\Final-Project\Shiny app\calculateStatistics_data.csv", index=False) year Mobility Mobilitystate Wrkstat Hrs1 Hrs2 Evwork \
0 1972 0.921235 0.697366 3.455673 NaN NaN 1.246201
1 1973 0.795235 0.697366 3.573803 39.882503 40.766667 1.226502
2 1974 0.795235 0.697366 3.585580 39.828610 38.977273 1.199688
3 1975 0.795235 0.768203 3.575168 38.967277 41.400000 1.210692
4 1976 0.897610 0.945297 3.625083 39.659973 39.892857 1.220365
Occ Prestige Wrkslf ... PolviewsLag NatspacLag FundLag \
0 489.826327 38.485142 1.897099 ... NaN NaN NaN
1 495.518055 38.623123 1.890126 ... -0.724265 1.123140 -0.887056
2 475.305310 39.447076 1.914328 ... -1.439654 1.976468 -1.089271
3 487.310472 38.483321 1.893917 ... -2.160625 2.123863 -0.822573
4 483.449040 38.400888 1.898669 ... -2.494316 1.936353 -0.779984
Fund16Lag SpfundLag ClassLag SatfinLag CoopLag ComprendLag \
0 NaN NaN NaN NaN NaN NaN
1 -0.785336 -0.941270 -0.788360 -1.881082 -0.881419 1.359171
2 -1.096949 -0.931674 0.836226 -1.188343 -0.997771 1.017964
3 -1.114826 -0.521741 0.836656 -1.451932 -1.475067 0.808359
4 -1.558210 -1.028737 0.040794 -0.538404 -1.054593 0.489306
XmarsexLag
0 NaN
1 1.676054
2 2.222613
3 1.006868
4 1.600159
[5 rows x 6177 columns]
We prepared the data by interpolating missing values using KNN imputation, standardizing the variables for consistency, and merging external datasets like immigration and GDP data to add context. Lagged variables were created to capture temporal relationships, ensuring the dataset was ready for regression and statistical analysis.
One key difference between our analysis and the authors’ methodology is how we handled missing data. While the authors used interpolation to address missing values—preserving more of the time-series structure in the data—we used KNN imputation in our analysis. This difference in data handling might explain some of the variations in the results, particularly in terms of the short-term impacts of mobility on happiness and trust.
Shiny app
To further help other researchers who are interested in studying effect of mobility on culture factors in GSS dataset, we build up a shiny app.
By using this Shiny App, users can freely choose any variables they are interested in, run the analysis, and visualize the results, ensuring the research process remains unbiased and transparent. This approach will allow future studies to validate their findings and ensure their analysis is not influenced by subjective variable selection.
# Load data paths
cleaned_data_path = r"C:\Users\Aurora\Desktop\final project\calculateStatistics_data_test.csv"
labels_data_path = r"C:\Users\Aurora\Desktop\final project\labels_data.csv"
# Load cleaned data
def load_cleaned_data():
data = pd.read_csv(cleaned_data_path)
data.columns = data.columns.str.upper() # Ensure all columns are uppercase to match labels_data
data = data.apply(pd.to_numeric, errors="coerce") # Ensure all columns are numeric
return data
# Load labels data
def load_labels():
labels = pd.read_csv(labels_data_path)
return labels
# Load data
cleaned_data = load_cleaned_data()
labels_data = load_labels()UI Design
# Define UI
app_ui = ui.page_fluid(
ui.tags.style("""
.output-separator {
margin-top: 50px;
margin-bottom: 300px;
}
"""),
ui.h2("Dynamic Variable Selection, PCA Clustering, and Regression Analysis"),
ui.input_checkbox_group(
"variable_types",
"Select Variable Types:",
{
"Likert Scale Variables": "Likert Scale Variables",
"Binary Variables": "Binary Variables",
"Continuous Variables": "Continuous Variables",
"Multichoice Variables": "Multichoice Variables",
"Administration Variable": "Administration Variable",
},
selected=["Likert Scale Variables"],
),
ui.input_slider(
"missing_threshold",
"Select Missing Value Threshold:",
min=0,
max=int(labels_data["missing_count"].max()),
value=5,
step=1,
),
ui.output_text_verbatim("selected_variables"),
ui.input_action_button("generate_plot", "Generate PCA Plot"),
ui.div(ui.output_image("pca_plot"), class_="output-separator"),
ui.div(ui.output_text_verbatim("cluster_summary"), class_="output-separator"),
ui.input_action_button("generate_regression", "Generate Regression Results"),
ui.div(ui.output_table("regression_results"), class_="output-separator"),
ui.div(ui.output_image("regression_lines_filtered"), class_="output-separator"),
# Add buttons for histogram and raincloud plot
ui.input_action_button("generate_histogram", "Generate Histogram Plot"),
ui.div(ui.output_image("histogram_plot"), class_="output-separator"),
ui.input_action_button("generate_raincloud", "Generate Raincloud Plot"),
ui.div(ui.output_image("raincloud_plot"), class_="output-separator"),
)In the Shiny app interface, users select variables based on two main criteria: Variable Types and Missing Value Threshold for the period from 1972 to 2018. Users can filter variables by selecting specific types and adjusting the threshold slider. Based on the selected variables, the system generates three visualizations and a regression results table.
Server Logic
Altair–Static Plot
Before doing regression and cherry-picking test, let’s draw some pictures using altair to observe the patterns and trends between mobility and Happy, Trust and Fair first.
Focused variables over Time
# Melt the DataFrame to long format for Altair
gss_melted = calculateStatistics_data.melt(
id_vars=["year"], # Ensure 'year' exists in your DataFrame
value_vars=["Mobility", "Trust", "Fair", "Happy"],
var_name="Variable",
value_name="Value" # Renamed to 'Value' to avoid duplication with 'Variable'
)
# Define colors for the variables
color_scale = alt.Scale(
domain=["Mobility", "Trust", "Fair", "Happy"], # Ensure these match the melted variable names
range=["#e41a1c", "#377eb8", "#4daf4a", "#984ea3"]
)
# Create the line chart
line_chart = alt.Chart(gss_melted).mark_line().encode(
x=alt.X("year:O", title="Year"), # Ensure 'year' exists in your DataFrame
y=alt.Y("Value:Q", title="Variable Value"), # Updated to use 'Value' for Y-axis
color=alt.Color("Variable:N", scale=color_scale, title="Legend"),
strokeDash=alt.StrokeDash("Variable:N") # Different line styles for each variable
).properties(
width=800,
height=400,
title="Cultural Factors Over Time"
)
# Add trend lines for each variable
trend_lines = line_chart.transform_regression(
"year", "Value", groupby=["Variable"] # Updated to use 'Value' instead of non-existent Z-Scaled Scores
).mark_line(strokeDash=[5, 2], opacity=0.7)
# Combine the line chart and trend lines
final_chart = (line_chart + trend_lines).interactive()
# Display the chart
final_chartEffects of mobility on Trust, Fair, Happy
# Create a new column for Mobility categorized into Low, Medium, High
calculateStatistics_data['Mobility_Category'] = pd.cut(calculateStatistics_data['Mobility'],
bins=[-float('inf'), calculateStatistics_data['Mobility'].quantile(0.33),
calculateStatistics_data['Mobility'].quantile(0.66), float('inf')],
labels=["Low", "Medium", "High"])
# Boxplot: Show effect of Mobility on Trust, Fair, and Happy, grouped by Mobility categories
boxplot = alt.Chart(calculateStatistics_data).transform_fold(
fold=['Trust', 'Fair', 'Happy'], # Including Happy as a dependent variable
as_=['Variable', 'Value']
).mark_boxplot().encode(
x=alt.X('Variable:N', title='Dependent Variables'), # X-axis: dependent variables
y=alt.Y('Value:Q', title='Value'), # Y-axis: values of dependent variables
color=alt.Color('Variable:N', title='Legend'), # Different colors for each variable
column=alt.Column('Mobility_Category:N', title='Mobility Categories') # Boxplot separated by Mobility categories
).properties(
width=300,
height=300,
title='Boxplot: Effect of Mobility on Trust, Fair, and Happy'
)
boxplotgss_melted2 = calculateStatistics_data.melt(
id_vars=["Mobility"], # Use Mobility as the x-axis
value_vars=["Happy", "Trust", "Fair"], # Variables to plot as separate lines
var_name="Variable", # Column name for variable names
value_name="Value" # Column name for variable values
)
# Create the combined line chart
chart = alt.Chart(gss_melted2).mark_line(point=True).encode(
x=alt.X('Mobility:Q', title='Mobility'), # Mobility on the x-axis
y=alt.Y('Value:Q', title='Values'), # Combined values of Happy, Trust, Fair
color=alt.Color('Variable:N', title='Variable'), # Color each line by Variable
tooltip=[
alt.Tooltip('Variable:N', title='Variable'), # Show variable name
alt.Tooltip('Mobility:Q', title='Mobility'),
alt.Tooltip('Value:Q', title='Value')
]
).properties(
width=800,
height=400,
title='Mobility vs Happy, Trust, and Fair'
).interactive() # Enable interactivity for zoom and pan
chartRegression and Dynamic Plots
# Server logic
def server(input, output, session):
# Reactive function to filter variables
@reactive.Calc
def filtered_variables():
selected_types = input.variable_types()
missing_threshold = input.missing_threshold()
# Map variable types to columns in the labels_data
type_column_map = {
"Likert Scale Variables": "Likert Scale Variables",
"Binary Variables": "Binary Variables",
"Continuous Variables": "Continuous Variables",
"Multichoice Variables": "Multichoice variables",
"Administration Variable": "Administration variable",
}
selected_columns = [
type_column_map[typ] for typ in selected_types if typ in type_column_map
]
if not selected_columns:
return []
# Filter labels_data by selected types and missing value threshold
filtered_data = labels_data.loc[
(labels_data[selected_columns].sum(axis=1) > 0) &
(labels_data["missing_count"] <= missing_threshold)
]
final_variables = [var.upper() for var in filtered_data["variable"].tolist()]
if len(final_variables) == 0:
return None
return final_variables
@output
@render.text
def selected_variables():
variables = filtered_variables()
if variables is None:
return "No variables selected due to filtering."
return f"Selected Variables: {', '.join(variables)}"
@reactive.event(input.generate_plot)
def generate_pca_and_clusters():
selected_vars = filtered_variables()
if selected_vars is None or len(selected_vars) < 2:
return None
try:
data_for_pca = cleaned_data[selected_vars]
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_for_pca)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_imputed)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(pca_result)
return pca_result, clusters, selected_vars
except Exception as e:
print(f"Error in PCA or Clustering: {e}")
return None
@output
@render.image
def pca_plot():
pca_data = generate_pca_and_clusters()
if not pca_data:
return None
pca_result, clusters, variable_names = pca_data
try:
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
plt.figure(figsize=(10, 7))
for cluster in np.unique(clusters):
cluster_indices = np.where(clusters == cluster)[0]
plt.scatter(
pca_result[cluster_indices, 0],
pca_result[cluster_indices, 1],
label=f"Cluster {cluster + 1}",
alpha=0.7,
)
for idx in cluster_indices:
plt.annotate(
variable_names[idx], # 添加变量名称
(pca_result[idx, 0], pca_result[idx, 1]), # 在对应的 PCA 点上标注
fontsize=8, # 字体大小
alpha=0.7, # 透明度
)
plt.title("PCA Result and Clustering")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.savefig(tmpfile.name)
plt.close()
return {"src": tmpfile.name, "mime_type": "image/png"}
except Exception as e:
print(f"Error during plotting: {e}")
return None
@output
@render.text
def cluster_summary():
pca_data = generate_pca_and_clusters()
if not pca_data:
return "No clusters to summarize."
_, clusters, variable_names = pca_data
summary = []
for cluster in np.unique(clusters):
cluster_vars = [variable_names[idx] for idx in np.where(clusters == cluster)[0]]
summary.append(f"Cluster {cluster + 1}: {', '.join(cluster_vars)}")
return "\n".join(summary)
@reactive.event(input.generate_regression)
def generate_filtered_regression():
selected_vars = filtered_variables()
if selected_vars is None or len(selected_vars) < 2:
return None, None
regression_results = []
regression_lines = []
try:
for var in selected_vars:
lag_var = f"{var}Lag"
if lag_var not in cleaned_data.columns:
cleaned_data[lag_var] = cleaned_data[var].shift(1)
if "MOBILITYLAG" not in cleaned_data.columns:
return "Required variable 'MOBILITYLAG' not found in data.", None
regression_data = cleaned_data[[var, lag_var, "MOBILITYLAG"]].dropna()
if regression_data.empty:
continue
X = regression_data[["MOBILITYLAG", lag_var]]
y = regression_data[var]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
regression_results.append({
"Variable": var,
"R-squared": model.rsquared,
"MOBILITYLAG_coef": model.params["MOBILITYLAG"],
"Lag_coef": model.params[lag_var],
})
x_vals = np.linspace(regression_data["MOBILITYLAG"].min(), regression_data["MOBILITYLAG"].max(), 100)
y_vals = model.params["const"] + model.params["MOBILITYLAG"] * x_vals
regression_lines.append((x_vals, y_vals, var))
# 篩選圖形輸出需要的前 10 個變量
top_results = sorted(regression_results, key=lambda x: abs(x["MOBILITYLAG_coef"]), reverse=True)[:10]
top_vars = [result["Variable"] for result in top_results]
filtered_lines = [(x, y, var) for x, y, var in regression_lines if var in top_vars]
return pd.DataFrame(regression_results), filtered_lines
except Exception as e:
return f"Regression error: {e}", None
@output
@render.table
def regression_results():
results, _ = generate_filtered_regression()
if isinstance(results, str):
return pd.DataFrame({"Error": [results]})
return results # 返回完整的回歸結果
@output
@render.image
def regression_lines_filtered():
_, regression_lines = generate_filtered_regression()
if not regression_lines:
return None
try:
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
plt.figure(figsize=(10, 7))
for x_vals, y_vals, var in regression_lines:
plt.plot(x_vals, y_vals, label=var, linewidth=2)
plt.title("Top Variables by Coefficient")
plt.xlabel("MOBILITYLAG")
plt.ylabel("Fitted Values")
plt.legend()
plt.savefig(tmpfile.name)
plt.close()
return {"src": tmpfile.name, "mime_type": "image/png"}
except Exception as e:
print(f"Error during plotting: {e}")
return None
@reactive.event(input.generate_histogram)
@reactive.event(input.generate_raincloud)
def generate_selected_variable_clusters():
selected_vars = filtered_variables()
if selected_vars is None or len(selected_vars) < 2:
return None
try:
data_for_pca = cleaned_data[selected_vars]
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_for_pca)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_imputed)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(pca_result)
return pd.DataFrame({"variable": selected_vars, "cluster": clusters})
except Exception as e:
print(f"Error in PCA or Clustering for selected variables: {e}")
return None
def generate_all_variable_clusters():
try:
data_for_pca = cleaned_data.iloc[:, 1:] # Assuming the first column is metadata like "year"
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_for_pca)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_imputed)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(pca_result)
variables = cleaned_data.columns[1:] # Exclude the first column (metadata)
return pd.DataFrame({"variable": variables, "cluster": clusters})
except Exception as e:
print(f"Error in PCA or Clustering for all variables: {e}")
return None
def classify_variables():
cluster_info = generate_all_variable_clusters()
if cluster_info is None or cluster_info.empty:
return None
selected_clusters = generate_selected_variable_clusters()
if selected_clusters is None or selected_clusters.empty:
return None
interested_vars = selected_clusters["variable"].tolist()
selected_cluster_ids = selected_clusters["cluster"].unique()
proximate_vars = cluster_info.loc[
cluster_info["cluster"].isin(selected_cluster_ids) & ~cluster_info["variable"].isin(interested_vars),
"variable"
].tolist()
distal_vars = cluster_info.loc[
~cluster_info["cluster"].isin(selected_cluster_ids),
"variable"
].tolist()
return {
"interested": interested_vars,
"proximate": proximate_vars,
"distal": distal_vars
}
@output
@render.image
def histogram_plot():
classified = classify_variables()
if classified is None:
return None
grouped_stats = cleaned_data.melt(id_vars=["year"], var_name="variable", value_name="value")
grouped_stats["category"] = grouped_stats["variable"].apply(
lambda var: (
"Interested" if var in classified["interested"] else
"Proximate" if var in classified["proximate"] else
"Distal"
)
)
try:
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
plt.figure(figsize=(12, 8))
sns.barplot(
data=grouped_stats,
x="value",
y="variable",
hue="category",
dodge=False,
palette={"Interested": "#83CA55", "Proximate": "#F36F61", "Distal": "#34548B"}
)
plt.title("Mean Values of Variables by Category")
plt.xlabel("Mean Value")
plt.ylabel("Variable")
plt.legend(title="Category")
plt.savefig(tmpfile.name)
plt.close()
return {"src": tmpfile.name, "mime_type": "image/png"}
except Exception as e:
print(f"Error during histogram plot: {e}")
return None
@output
@render.image
def raincloud_plot():
classified = classify_variables()
if classified is None:
return None
grouped_stats = cleaned_data.melt(id_vars=["year"], var_name="variable", value_name="value")
grouped_stats["category"] = grouped_stats["variable"].apply(
lambda var: (
"Interested" if var in classified["interested"] else
"Proximate" if var in classified["proximate"] else
"Distal"
)
)
try:
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
plt.figure(figsize=(12, 8))
sns.violinplot(
data=grouped_stats,
x="category",
y="value",
scale="width",
inner="box",
linewidth=1.2,
palette={"Interested": "#83CA55", "Proximate": "#F36F61", "Distal": "#34548B"}
)
sns.stripplot(
data=grouped_stats,
x="category",
y="value",
size=4,
jitter=True,
alpha=0.7,
palette={"Interested": "#83CA55", "Proximate": "#F36F61", "Distal": "#34548B"}
)
plt.title("Raincloud Plot of Values by Category")
plt.xlabel("Category")
plt.ylabel("Value")
plt.savefig(tmpfile.name)
plt.close()
return {"src": tmpfile.name, "mime_type": "image/png"}
except Exception as e:
print(f"Error during raincloud plot: {e}")
return NoneThe first visualization graph involves Exploratory Graph Analysis (EGA), where variables are clustered using a structured process. Selected variables are standardized and subjected to Principal Component Analysis (PCA) for dimensionality reduction, mapping the data onto two principal components for visualization. These components are then clustered using the K-Means algorithm, dividing the variables into four distinct groups. The results are displayed in a PCA plot, showing the distribution of variables across clusters. From this plot, we know clearly about which cluster our selected variables belong or don’t belong to and what are the other variables in the same cluster as our selected variables so that we are able to do some further analysis.
The regression results table complements the visualizations by presenting coefficients, t-values, p-values, and R-squared values for the relationship between the selected variables and mobility. This dual approach enhances the analytical depth, offering users both statistical and visual insights into the data.
Then, we categorized variables into three distinct types for our analysis:
Interested/focal Variables: These are user-selected variables that represent specific areas of focus. They are explicitly identified by the user and belong to a particular cluster of interest.
Proximate Variables: These variables share the same cluster as the Interested Variables but are not directly chosen by the user. They are closely related but remain unselected.
Distal Variables: These variables belong to clusters other than the one containing the Interested Variables. They represent less direct or more distant relationships to the primary focus.
Based on these three categories, we created the histogram chart and the raincloud chart.
The first chart is a histogram plot that visualizes the absolute t-values for regression coefficients, categorizing variables into the three types: Interested, Proximate-other, and Distal-other. Each bar represents a variable, with the t-value magnitude indicating the strength of the relationship between that variable and the independent variable–mobility. The color coding helps distinguish the variable types based on their clustering.
The second chart is a raincloud plot comparing the distribution of absolute t-values across the same three clusters: Interested, Proximate-other, and Distal-other, along with an additional cluster, All-other. The plot shows the spread, central tendency, and individual data points of t-values for each cluster. The annotated Z and p-values compare the statistical significance of differences between clusters, highlighting potential variations in regression performance among the clusters. This detailed visualization further underscores the robustness of our approach, demonstrating that our variable selection methodology maintains objectivity and credibility.
Cherry-picking test–Dynamic Plots
# Load data paths
cleaned_data_path = r"F:\python2\Final-Project\Shiny app\calculateStatistics_data.csv"
labels_data_path = r"F:\python2\Final-Project\Shiny app\labels_data.csv"
# Load cleaned data
def load_cleaned_data():
data = pd.read_csv(cleaned_data_path)
data.columns = data.columns.str.upper() # Ensure all columns are uppercase
data = data.apply(pd.to_numeric, errors="coerce") # Ensure all columns are numeric
return data
# Load labels data
def load_labels():
labels = pd.read_csv(labels_data_path)
return labels
# Data loading
cleaned_data = load_cleaned_data()
labels_data = load_labels()
# Function to filter variables
def filter_variables(variable_types, missing_threshold):
type_column_map = {
"Likert Scale Variables": "Likert Scale Variables",
"Binary Variables": "Binary Variables",
"Continuous Variables": "Continuous Variables",
"Multichoice Variables": "Multichoice variables",
"Administration Variable": "Administration variable",
}
selected_columns = [type_column_map[typ] for typ in variable_types if typ in type_column_map]
if not selected_columns:
return []
filtered_data = labels_data.loc[
(labels_data[selected_columns].sum(axis=1) > 0) &
(labels_data["missing_count"] <= missing_threshold)
]
return [var.upper() for var in filtered_data["variable"].tolist()]# Perform PCA and clustering
def perform_pca_and_clustering(selected_vars):
if not selected_vars or len(selected_vars) < 2:
print("Insufficient variables selected for PCA.")
return None, None, None
data_for_pca = cleaned_data[selected_vars]
# Transpose data to cluster variables (columns)
data_for_pca = data_for_pca.T
# Fill missing values
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_for_pca)
# Standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_imputed)
# Perform PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)
# Perform clustering
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(data_scaled)
return pca_result, clusters, data_for_pca.index.tolist()
# Generate PCA plot
def generate_pca_plot(pca_result, clusters, variable_names):
if pca_result is None or clusters is None:
return
plt.figure(figsize=(10, 7))
for cluster in np.unique(clusters):
cluster_indices = np.where(clusters == cluster)[0]
plt.scatter(
pca_result[cluster_indices, 0],
pca_result[cluster_indices, 1],
label=f"Cluster {cluster + 1}",
alpha=0.7,
)
for idx in cluster_indices:
plt.annotate(
variable_names[idx],
(pca_result[idx, 0], pca_result[idx, 1]),
fontsize=8,
alpha=0.7,
)
plt.title("PCA Result and Clustering")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.tight_layout()
plt.show()# Perform regression
def perform_regression(selected_vars):
if not selected_vars or len(selected_vars) < 2:
print("Insufficient variables selected for regression.")
return None
results = []
for var in selected_vars:
lag_var = f"{var}Lag"
if lag_var not in cleaned_data.columns:
cleaned_data[lag_var] = cleaned_data[var].shift(1)
if "MOBILITYLAG" not in cleaned_data.columns:
print("Required variable 'MOBILITYLAG' not found in data.")
continue
regression_data = cleaned_data[[var, lag_var, "MOBILITYLAG"]].dropna()
if regression_data.empty:
continue
X = regression_data[["MOBILITYLAG", lag_var]]
y = regression_data[var]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
results.append({
"Variable": var,
"R-squared": model.rsquared,
"MOBILITYLAG_coef": model.params["MOBILITYLAG"],
"Lag_coef": model.params[lag_var],
"p-value (MOBILITYLAG)": model.pvalues["MOBILITYLAG"],
"t-value": model.tvalues["MOBILITYLAG"]
})
return pd.DataFrame(results)
# Generate cluster information
def generate_cluster_info(clusters, variable_names):
cluster_info = {f"Cluster {i + 1}": [] for i in np.unique(clusters)}
for var, cluster in zip(variable_names, clusters):
cluster_info[f"Cluster {cluster + 1}"].append(var)
return cluster_info# Plot regression fit lines
def plot_regression_fit_lines(cleaned_data, top_vars):
plt.figure(figsize=(12, 8))
for var, lag_var, model in top_vars:
x_vals = np.linspace(cleaned_data["MOBILITYLAG"].min(), cleaned_data["MOBILITYLAG"].max(), 100)
y_vals = model.params["const"] + model.params["MOBILITYLAG"] * x_vals
plt.plot(x_vals, y_vals, label=f"{var} (Coef: {model.params['MOBILITYLAG']:.2f})", alpha=0.8)
plt.title("Regression Fit Lines for Top 10 Variables", fontsize=14)
plt.xlabel("MOBILITYLAG", fontsize=12)
plt.ylabel("Fitted Value", fontsize=12)
plt.legend(loc="best", fontsize=10)
plt.tight_layout()
plt.show()# Generate barplot and raincloud plot
def generate_plots(regression_results, clusters, variable_names, interested_vars):
if regression_results is None:
return
# Assign categories
results = []
for var in variable_names:
if var in interested_vars:
category = "Interested"
elif clusters[variable_names.index(var)] == clusters[variable_names.index(interested_vars[0])]:
category = "Proximate"
else:
category = "Distal"
t_value = regression_results.loc[regression_results["Variable"] == var, "MOBILITYLAG_coef"].abs().values[0]
results.append({"variable": var, "abs_t_value": abs(t_value), "category": category})
results_df = pd.DataFrame(results)
# Define soft colors for categories
color_palette = {
"Interested": "#a6cee3", # Soft blue
"Proximate": "#b2df8a", # Soft green
"Distal": "#fb9a99", # Soft pink
}
# Barplot
plt.figure(figsize=(12, 8))
sns.barplot(
data=results_df,
x="abs_t_value",
y="variable",
hue="category",
order=results_df.sort_values("abs_t_value", ascending=False)["variable"],
palette=color_palette # Use the soft color palette
)
plt.title("Variables by Absolute T-Value and Category")
plt.xlabel("Absolute T-Value")
plt.ylabel("Variables")
plt.legend(title="Category")
plt.tight_layout()
plt.show()
# Raincloud Plot
plt.figure(figsize=(10, 6))
sns.violinplot(
data=results_df,
x="category",
y="abs_t_value",
hue="category",
inner="box",
palette=color_palette # Use the soft color palette
)
sns.stripplot(
data=results_df,
x="category",
y="abs_t_value",
color="gray", # Use neutral color for strip points
size=2,
alpha=0.5,
jitter=True,
dodge=True
)
plt.title("Raincloud Plot: Absolute T-Values by Category")
plt.ylabel("Absolute T-Value for Regression Coefficient")
plt.xlabel("Category")
plt.legend(title="Category")
plt.tight_layout()
plt.show()# Main execution
if __name__ == "__main__":
variable_types = ["Likert Scale Variables"]
missing_threshold = 5
# Filter variables
selected_vars = filter_variables(variable_types, missing_threshold)
print(f"Selected Variables: {selected_vars}")
# Perform PCA and clustering
pca_result, clusters, variable_names = perform_pca_and_clustering(selected_vars)
# Display cluster information
cluster_info = generate_cluster_info(clusters, variable_names)
for cluster, vars_in_cluster in cluster_info.items():
print(f"{cluster}: {vars_in_cluster}")
# Generate PCA plot
generate_pca_plot(pca_result, clusters, variable_names)
# Perform regression
regression_results = perform_regression(selected_vars)
print(regression_results)
# Select top 5 positive and 5 negative variables by MOBILITYLAG_coef
regression_results["abs_coef"] = regression_results["MOBILITYLAG_coef"].abs()
top_5_positive = regression_results.sort_values("MOBILITYLAG_coef", ascending=False).head(5)
top_5_negative = regression_results.sort_values("MOBILITYLAG_coef", ascending=True).head(5)
top_10_vars = pd.concat([top_5_positive, top_5_negative])
# Prepare data for fit-line plot
top_vars_for_fit = []
for _, row in top_10_vars.iterrows():
var = row["Variable"]
lag_var = f"{var}Lag"
regression_data = cleaned_data[[var, lag_var, "MOBILITYLAG"]].dropna()
X = regression_data[["MOBILITYLAG", lag_var]]
y = regression_data[var]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
top_vars_for_fit.append((var, lag_var, model))
# Plot regression fit lines
plot_regression_fit_lines(cleaned_data, top_vars_for_fit)
# Generate barplot and raincloud plot
generate_plots(regression_results, clusters, variable_names, ["HAPPY", "TRUST", "FAIR"])Selected Variables: ['INCOM16', 'INCOME', 'RINCOME', 'PARTYID', 'POLVIEWS', 'NATSPAC', 'NATENVIR', 'NATHEAL', 'NATCITY', 'NATCRIME', 'NATDRUG', 'NATEDUC', 'NATRACE', 'NATARMS', 'NATAID', 'NATFARE', 'COURTS', 'FUND', 'ATTEND', 'RELITEN', 'FUND16', 'SPFUND', 'HAPPY', 'HAPMAR', 'HEALTH', 'LIFE', 'HELPFUL', 'FAIR', 'TRUST', 'CONFINAN', 'CONBUS', 'CONCLERG', 'CONEDUC', 'CONFED', 'CONLABOR', 'CONPRESS', 'CONMEDIC', 'CONTV', 'CONJUDGE', 'CONSCI', 'CONLEGIS', 'CONARMY', 'AGED', 'SATJOB', 'CLASS', 'SATFIN', 'FINALTER', 'FINRELA', 'DIVLAW', 'PREMARSX', 'XMARSEX', 'HOMOSEX', 'PORNLAW', 'NEWS', 'COOP', 'COMPREND']
C:\Users\Aurora\Anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning:
The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
C:\Users\Aurora\Anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1382: UserWarning:
KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
Cluster 1: ['INCOM16', 'POLVIEWS', 'CONLABOR', 'CLASS', 'FINRELA', 'DIVLAW', 'COOP']
Cluster 2: ['NATENVIR', 'NATHEAL', 'NATCITY', 'HAPPY', 'HEALTH', 'CONFINAN', 'CONBUS', 'CONJUDGE', 'SATFIN', 'FINALTER', 'PORNLAW']
Cluster 3: ['NATSPAC', 'NATEDUC', 'NATRACE', 'NATARMS', 'NATAID', 'NATFARE', 'COURTS', 'ATTEND', 'LIFE', 'FAIR', 'CONARMY', 'AGED', 'SATJOB', 'XMARSEX', 'COMPREND']
Cluster 4: ['INCOME', 'RINCOME', 'PARTYID', 'NATCRIME', 'NATDRUG', 'FUND', 'RELITEN', 'FUND16', 'SPFUND', 'HAPMAR', 'HELPFUL', 'TRUST', 'CONCLERG', 'CONEDUC', 'CONFED', 'CONPRESS', 'CONMEDIC', 'CONTV', 'CONSCI', 'CONLEGIS', 'PREMARSX', 'HOMOSEX', 'NEWS']

Variable R-squared MOBILITYLAG_coef Lag_coef p-value (MOBILITYLAG) \
0 INCOM16 0.349632 0.605932 0.021833 0.000327
1 INCOME 0.857104 -0.011820 0.897893 0.881496
2 RINCOME 0.852369 -0.016146 0.906253 0.851047
3 PARTYID 0.369032 -0.110403 0.560070 0.397747
4 POLVIEWS 0.256422 0.052881 0.506091 0.702383
5 NATSPAC 0.809107 0.261831 0.718413 0.013159
6 NATENVIR 0.291161 -0.121389 0.524541 0.373820
7 NATHEAL 0.359107 -0.132212 0.576263 0.310057
8 NATCITY 0.291323 -0.178681 0.483206 0.203033
9 NATCRIME 0.669806 -0.245629 0.626415 0.079623
10 NATDRUG 0.436751 -0.142801 0.577174 0.354858
11 NATEDUC 0.744171 0.034190 0.830840 0.720470
12 NATRACE 0.395229 0.259725 0.510744 0.082441
13 NATARMS 0.287628 0.449530 0.203132 0.003523
14 NATAID 0.752158 0.415447 0.563172 0.006596
15 NATFARE 0.399279 0.104576 0.591984 0.436708
16 COURTS 0.909077 0.413472 0.603531 0.004730
17 FUND 0.823022 -0.591361 0.375862 0.000417
18 ATTEND 0.823380 0.481285 0.459414 0.001223
19 RELITEN 0.851276 -0.411798 0.575858 0.011417
20 FUND16 0.466834 -0.412057 0.365759 0.008044
21 SPFUND 0.792352 -0.477446 0.467316 0.005218
22 HAPPY 0.545444 -0.704478 0.076012 0.000012
23 HAPMAR 0.237398 -0.174746 0.381176 0.239730
24 HEALTH 0.795703 -0.248964 0.736401 0.018568
25 LIFE 0.683288 0.588858 0.305798 0.000272
26 HELPFUL 0.129724 -0.391851 -0.136246 0.015426
27 FAIR 0.565535 0.460902 0.365767 0.004787
28 TRUST 0.581649 -0.348985 0.449649 0.031051
29 CONFINAN 0.593875 -0.228475 0.598375 0.101989
30 CONBUS 0.618772 -0.437587 0.407847 0.008682
31 CONCLERG 0.411215 -0.282758 0.428433 0.068023
32 CONEDUC 0.482008 -0.273609 0.502081 0.052554
33 CONFED 0.404906 -0.517333 0.213210 0.001382
34 CONLABOR 0.387771 0.513518 0.223179 0.000959
35 CONPRESS 0.851634 0.023306 0.925501 0.827054
36 CONMEDIC 0.762794 -0.272328 0.638587 0.061434
37 CONTV 0.841204 -0.055180 0.866860 0.551921
38 CONJUDGE 0.449008 -0.530020 0.233343 0.000886
39 CONSCI 0.017842 -0.135122 -0.073562 0.403979
40 CONLEGIS 0.809888 -0.378347 0.592758 0.002253
41 CONARMY 0.575449 0.677775 0.135499 0.000269
42 AGED 0.679948 0.327903 0.575656 0.017723
43 SATJOB 0.206936 0.363701 0.168566 0.033374
44 CLASS 0.745311 0.427836 0.495030 0.004023
45 SATFIN 0.426422 -0.118120 0.576344 0.385252
46 FINALTER 0.411221 -0.214999 0.502936 0.143049
47 FINRELA 0.391620 0.320080 0.411119 0.032022
48 DIVLAW 0.764618 0.270276 0.770787 0.021546
49 PREMARSX 0.879038 -0.231999 0.719013 0.025257
50 XMARSEX 0.563396 0.028309 0.711835 0.822504
51 HOMOSEX 0.919319 -0.328389 0.684967 0.053532
52 PORNLAW 0.677463 -0.624512 0.289227 0.000073
53 NEWS 0.899935 -0.273282 0.740811 0.062181
54 COOP 0.493158 0.160747 0.655171 0.171569
55 COMPREND 0.462921 0.186006 0.566009 0.167465
t-value
0 3.905978
1 -0.149961
2 -0.188917
3 -0.854165
4 0.384663
5 2.586662
6 -0.898695
7 -1.027227
8 -1.292656
9 -1.795381
10 -0.935297
11 0.360188
12 1.778210
13 3.087961
14 2.855156
15 0.785086
16 2.979775
17 -3.825530
18 3.462603
19 -2.643054
20 -2.779450
21 -2.943280
22 -4.937239
23 -1.192177
24 -2.446972
25 3.965800
26 -2.522706
27 2.975296
28 -2.229571
29 -1.670982
30 -2.750034
31 -1.871967
32 -1.993683
33 -3.420145
34 3.545997
35 0.219817
36 -1.920562
37 -0.599598
38 -3.572728
39 -0.842846
40 -3.248674
41 3.969128
42 2.466126
43 2.198098
44 3.039451
45 -0.877202
46 -1.491813
47 2.216161
48 2.385236
49 -2.318277
50 0.225700
51 -1.985130
52 -4.387124
53 -1.914834
54 1.390379
55 1.404121


C:\Users\Aurora\Anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
C:\Users\Aurora\Anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

# Create and run the app
import nest_asyncio
nest_asyncio.apply()
app = App(app_ui, server)
app.run(host="127.0.0.1", port=8273)Output analysis
Based on the results, we observe that the p-values for happy, fair, and trust are all below or around 0.05, indicating that mobility has a statistically significant impact on these variables. Notably, the coefficient for fair is positive(0.25), suggesting that as mobility increases, individuals perceive a greater sense of fairness in societal interactions. This could indicate that greater residential mobility exposes individuals to more diverse environments, fostering a sense of equitable opportunity and fairness in society.
On the other hand, while the literature highlights a positive link between mobility and happiness, our results show that the coefficients for happy and trust are negative(-0.58 and -0.26), implying that increased residential mobility correlates with a decrease in reported happiness and trust. This finding reveals a potential social insight: while mobility may broaden individuals’ exposure and opportunities, it can also disrupt social ties and create instability, leading to feelings of alienation or a loss of trust in others. These results suggest a dual-edged nature of mobility—it can promote perceptions of fairness while simultaneously eroding the emotional and social foundations of happiness and trust. This highlights the need for policymakers to balance efforts to increase mobility with initiatives that strengthen community cohesion and social support systems.
From the results of the histogram and raincloud plots, we uncover insights that diverge from those reported in the existing literature. While there is no cherry-picking issue in the literature, it does exist in our study.
Bar Chart
Blue bars represent our focal variables, showing the strongest effects, consistently ranking from 1st to 19th. In contrast, proximal-other variables (green bars), which are more strongly correlated with the focal variables, display a wider range of effect sizes, ranking from 2nd to near the bottom. Similarly, distal-other variables (red bars), which are less correlated with the focal variables, exhibit a similar wide range, ranking from 2nd to last.
Notably, the focal variables have a smaller range of effect sizes and higher t-values overall, suggesting a stronger and more consistent influence. However, this does not entirely rule out the possibility of cherry-picking.
Raincloud Plot
The raincloud plot further tests whether the mean effect-size distribution associated with our focal variables differs from that of other dependent variables.
A permutation-based comparison of effect sizes revealed a significant difference between the focal variables and all other groups. This finding suggests that cherry-picking might be present, as the observed differences indicate a non-random selection of variables.
Policy Implications
The findings suggest that residential mobility has a complex and dual-edged impact on individual and societal well-being. The positive correlation between mobility and perceived fairness indicates that mobility can create an environment where individuals experience greater equity and societal justice. Policymakers can leverage this by encouraging mobility through programs such as job relocation assistance, education grants, or housing subsidies that promote equitable access to opportunities.
However, the negative impact of mobility on happiness and trust raises concerns about the social and psychological costs of increased mobility. As individuals move more frequently, the erosion of trust and community ties can lead to feelings of alienation and lower overall life satisfaction. Policymakers must address these unintended consequences by fostering stronger social networks and support systems. For example:
Community-Building Programs: Investing in initiatives that strengthen community bonds, such as local engagement programs, shared public spaces, or neighborhood events, can help mitigate the social disconnection caused by mobility.
Mental Health Support: Introducing mental health resources and support systems for individuals experiencing mobility-related stress or loneliness could alleviate negative emotional outcomes.
Trust-Building Campaigns: Promoting social trust through targeted campaigns, inclusive policies, and fostering shared cultural or civic values may counteract the erosion of interpersonal trust.
These findings highlight the importance of designing mobility-enhancing policies that are socially conscious and proactive in addressing the trade-offs between opportunity creation and emotional well-being.
Directions for Future Work
While this study sheds light on the impact of residential mobility on fairness, happiness, and trust, there are several areas for future exploration:
Longitudinal Analysis: Since we only study patterns across 1978-2018, investigating a longer-term effects of mobility on well-being continuously could provide deeper insights into whether the observed impacts persist, diminish, or intensify over time.
Geographical Variations: Analyzing mobility’s effects in urban vs. rural settings or across different socioeconomic strata could uncover regional or demographic differences in outcomes.
Mechanisms Behind Trust Decline: Future research could focus on understanding the mechanisms that drive the decline in trust due to mobility. For example, does it stem from weaker interpersonal relationships, cultural dissonance, or perceived competition?
Experimental Approaches: Designing controlled experiments or natural experiments, for example, randomly assigning residents to move from one place to another to avoid pre-existing different baseline personal characteristics and validate causal relationships between mobility and personal characteristics could strengthen the robustness of these findings.
By addressing these areas, future work can provide more granular and actionable insights for policymakers, ensuring that residential mobility fosters not only equitable opportunities but also emotional and social well-being.